DOCUMENT RESUME 



ED 399 294 



TM 025 609 



AUTHOR 
TITLE 
PUB DATE 
NOTE 



PUB TYPE 



EDRS PRICE 
DESCRIPTORS 



IDENTIFIERS 



Johanson, George A. 

A Compromise Grading Model for Classroom Tests- 
Apr 92 

6p. ; Paper presented at the Annual Meeting of the 
American Educational Research Association (San 
Francisco, CA, April 20“24, 1992). 

Reports ” Evaluat ive/Feas ibi 1 i ty (142) — 
Speeches/Conference Papers (150) 

MFOI/PCOI Plus Postage. 

^Criterion Referenced Tests; Cutting Scores; 
’'Educational Testing ; Error of Measurement ; ’'Grading ; 
’'Norm Referenced Tests; Sampling; Secondary 
Education; ’'Secondary School Students ; Standards ; 
''Test Construction; Test Use 
’’Tompromi s e 



ABSTRACT 

Most educational measurement texts distinguish 
between norm-referenced (NR), or relative, methods of assigning 
letter grades to objective test scores, and criterion-referenced 
(CR) , or absolute, methods. Both NR and CR approaches have serious 
limitations in typical classroom situations, and neither approach, in 
its pure form, may be entirely suitable. An alternative method is 
proposed and illustrated with scores from 57 secondary school 
students taking a 26~item objectively scored test. The approach 
involved using a smoothed or fitted cumulative distribution and a 
ratio of standard errors to fix the slope of the line through the 
ideal cut-points. This is a modification of the method of C. H. Beuk 
(1984). The rationale for this type of compromise is that it 
acknowledges the sample status of both the set of test items and the 
group of examinees and shares sampling error equally between NR and 
CR methods. The algorithm has been programmed in PASCAL for the 
microcomputer. A structured grading method of this sort would allow 
teachers of multiple sections or those within the same department to 
give somewhat comparable grades to their students if they used 
agreed-on NR standards and individual CR standards. This compromise 
would be especially useful when an entirely new test is used or an 
unfamiliar group of students is encountered. (Contains 1 table, 2 
figures, and 10 references.) (SLD) 



Reproductions, supplied by EDRS are the best that can be made * 

from the original document . * 

y? y^vcVc**** ************ ********** ****** ***5'c5V*****yc** ************ *vc*ycVc** 



ERIC 






U.8. DEPACTTMENT Of EOdCATIOM 

Office o# Educotionoi Research and Improvomeni 

educational resources information 

CENTER (ERIC) 

□ This document has been reproduced as 
received from the person or organization 
ongmatir^ it. 

□ Minor changes have been made to improve 
reproduction quality. 



Q Pointsof view or opinions stated in this docu- 
ment do not necessarily represent officisl 
OERi position or policy. 



PERMISSION TO REPRODUCE AND 
DISSEMINATE THIS MATERIAL 
HAS BEEN GRANTED BY 



A COKPROKISE GRADING NODE FOR CLASSROOM TESTS 

George A. Johanson, College of Education, Ohio Dniversity 

I ' ... 

I A Paper Presented at the Aaerican Education Research Association Annual Meetin 

San Francisco, April 1992 





CTc>HMi6atS 

TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC) 



Objectives 

Methods for assigning letter grades to a set of objective test scores would seen to be a soitetdiat neglected area of technical 
concern. Most educational aeasureaent texts distinguish between nora-referenced (NR) or relative aethods and doaain/criteriim- referenced 
(CR) or absolute aethods. However, both NR and CR approaches can be seen to have serious liaitations for use in typical classrooi situatioK. 

^ In a rather coaprehensive exaaination, at a relatively low cognitive level (for exaaple, in aastery leamng), and with relatively few 
students, absolute standards such as percentage-correct seea acre ^ropriate in the sense of yielding inte^retable scores (e.g., percent 
aastery of the content doaain). If there are aany students, testing is at a higher cognitive level, and the test is coaprisd of a less than 
" coaprehensive set of iteas, then relative standards (such as z-scores) aight seea acre interpretable (e.g., perc^ile rank in the 
population). That is, if the saaple of students is large enough to be representative of the population and to yield accurate (percentile) 
estiaates of each student's relative position in the pojwlation, then ffi aay be a viable ajproach to grading. Conversely,. if the content 
doaain is clearly defined (factual) as opposed to iaplied (higher level skills) and the sa^le of iteas is large enough to accurately reflect 
the content doaain and to yield accurate estiaates of each student's score (proportion of iteas correct), then CR aay be an appropriate 
grading aethod. 

In a usual classrooa situation, there aight be 20-60 exaainees with 20-60 test iteas at a variety of cognitive levels and therefiire 
neither approach, in its pure fora, seeas particularly well-suited. Recognizing these and other factors, Terwilliger (1989) reconfinds using 
CR for soae grading decisions and NR for others. 

A coaproaise between NR and CR seeas both reasonable and consistent with current practice. Consistent, in that aany teachers use 
absolute standards in the fora of percent-correct (soaetiaes because of school or district poliq), but then_ 'adjust' the raw scores in a 
variety of ways if the distribution of grades seeas inappropriate or iaprobable. Indeed, soae tochers relying solely on NR grading are quick 
to adait that they also have CR 'Halts' and will, for exaiqile, not award an 'A' to any score below a certain percentage-of-iteas-correct. A 
coaproaise is reasonable since a valid interpretation of a CR or a NR grade requires knowledge of either the content doaain or the population 
of students, respectively. The blending of the two approaches aight brtter reflect the actual partial kncwledge of both the content doaain 
and population by the typical consuaer of the grade. Indeed, grades are often seen to reflect, to soae extent, both absolute and relative 
achieveaent. 

FroB the responses of students in ay aeasureaent classes, the aost popular procedures for adjusting the scores would seea to be 
'gapping' or 'eyeballing' (Mainer i Schacht, 1978), adding a fixed nui4)er (or percentage) of points to ev^one's score, dropping iteas, or 
siaply Baking sure the next test results in grades with a coapensating distribution. It is soBewhat ironic that if both the aean and standard 
deviation of a doaain-referenced test are adjusted using a linear transforaation then we have the equivalent of a aost prevalent fora of nora- 
referencing, the z-score. The focus of this paper is on an alternative aethod of adjustaent. 

Theoretical Fraaework and Exaaple with Real Data 

Hofstee (1983) has suggested using a cuaulative frequency distribution to better see the relationship between NR and CR 
decision-aaking. Mhile Hofstee did this in reference to large-scale testing, soae of the principles involved apply equally well to the 
classrooa situation. Figure 1 is an exaaple of a cuaulative nuaber- correct frequency distribution. The scores are froa 57 secondary 
students taking a 26-itea objectively-scored test. A score of 15 would be aj^roxiaately at the 45th percentile. Since there were 5 grading 
categories: A, B, C, D, F, we can identify the expected outcoaes by locating points 1-4 using both the absolute standards we have set and our 
past grading practice with this unit of study. That is, if we have obsened, over aany sections, that 16l of the students received A's, 23l 
received B's, 181 received C's, 201 raeived D's, and 231 received F's, then the NR standards aay be seen on the vertical axis irtiere the 
O' cuaulative proportion of students below each nuiber-correct score is given. If it is felt that 931 or acre of the iteas aust be answered 
O correctly to receive an A, 851 or Bore to receive a B, 751 or aore to raeive a C, and 651 or Bore to receive a D, then the corresponding 

niaijer-correct standards can be seen on the horizontal axis. The intersections of the expected cut-points (1-4) are such that both the NR and 
vn CR expectations are siaultaneously aet if and only if the obsened distribution ^ses through these points. That is, if these intersections 
are on the obsened cuaulative distribution, then we are done; if not, soae daision or coaproaise is necessary. In a sense, the 
O intersections are points on our best estiaate of a population cuaulative distribution. 

< The particular coaproaise suggested by Hofstee involves setting ainiaua and eaxiaua acceptable percentages-correct about each 

^ py poftoH CR cut-point and ainiaua and aaxiaua acceptable proportions of students in each grading category about each NR cut-point to deteraine 
‘ O ‘ ipe of a diagonal line through the aeeting point. This line is then extended to aeet the observed cuaulative distribution and the 

ERIC 



Conpronise Grading Model 



intersection is the conproiiise. 

BeuX (1984) proposed that the coipronise be obtained by using the ratio of standard deviations of the ratings of a group of judges 
as the slope of the diagonal. De Gruijter (1985) uses esti&ates of the uncertainties concerning both the NR and CR ideals to define a faiily 
of ellipses, selects the tangent ellipse, and uses the abscissa of the intersection as the conproaise cut-score. 

These proceiures all require additional judgoents and do not directly taXe into consideration the notion that, all else teing equal, 
with larger nuabers of exaainees and fewer iteas it ai^t be reasonable to depend aore heavily upon the NR criteria and vice-versa. Biat is, 
if we have a prinarily CR grading philosophy, then we would be concerned that there are a sufficient nunber of iteis to adequately represent a 
clearly identified content doaain and to perait accurate estiaation of a student's doaain score. If priaarily NR, the concern would be to 
have a sufficient nunber of students (representative of the population) in the saiq)le to accurately estiaate a student's percentile ranX. In 
a coaproaise situation, aore wei^t aight be given to the aore accurate ^tiaation at each decision level. 

An additional prc^lea is encountered using the obsened cuiulative distribution. As the ratio of test length to nunber of students 
increases, there will be aore and larger gaps or zero frequencies in the frequency distribution and these will be seen as 'flat spots' in the 
cunulative distribution. Due to these randon gaps and other saiq)le fluctuations, the need to snooth the obsened distribution arises. Nhile 
there are a nunber of saoothing approaches available, the beta-binoaial (or negative hypergeoaetric) aodel has been found to be a nost 
efficient presnoother for equipercentile equating (PairtanX, 1987) and has also been successfully used to nodel nunber-correct achievenent 
test score data (Duncan, 1974; Keats & Lord, 1962). Lord and NovicX (1968) recoaaend this aodel for fitting obsenel distributions of 
nunber-correct scores and provide a theoretical rationale for the nodel. A convenient algorithn for costing the beta-binonial is available 
(Huynh, 1979). 



Method 

The approach followed was to use a saoothed or fitted cunulative distribution and to use a ratio of standard errors to fix the slope 
of the line through the ideal cut-points. iMs is a nodification of BeuX's nethod in that his ratio of standard deviations is also a ratio of 
standard errors (the sane judges are us«l to provide both standard deviations). It is inportant to note, however, that there is no variation 
in our ideal cut-points; these points nay be thought of as teing aXin to population paraneters. 

If we conceptually fix a CR standard and the test, then each saiq)le of students or class froa our (assuael infinite) population of 
students will yield a saiq)le proportion of students at or below this CR standard and this sanple proportion nay te co^ed to the 
hypothesized (population) proportion. The standard error of such proportions is given by (»„(l-f„)/n)°*® idiere n is the nunber of exaainees 
and T„ is the population proportion or CR standard. In exactly the sane way, we nay inagine a single E standard (proportion of itens 
correct=r,c) and class or group of students as fixel and compute the standard error of the proportion of items answered correctly, 
(rk(l’>fk)A)°‘®( as if the X items on our test were a sample from the (assumed infinite) content domain. The compromise is to use the ratio 
of these standard errors, [(ir„(l-f„)/n)°‘®]/[(»n(l-»k)A)°‘®]/ as the slope of the line through the ideal points. 

Tlie rationale for this t^e of compromise is ^at it acXnowledges the sample status of both the set of test items and the group of 
examinees and shares the sampling error equally between methods (E and CR). In particular, the sanple of itens is given the sane credibility 
as the sample of students in that the compromise at each decision level departs from the E and CR standards by the same number of standard 
errors. 

In practice, standard errors are largely influenced by sample size and this means that when the ratio of number of items to number 
of examinees is large, the tendency will be for the compromise to rely more heavily on the CR standards. Nhen the ratio is small, the 
compromise will rely more heavily on the E standards. That is, reliance is placed on both E and CR standards, but the compromise at each 
dKision point tends to proportionally favor the standard with the smaller standard error. 

In the example, the CR cut-point between a grade of B and C was to be a percentage-correct score of 85i. llie standard error of a 
proportion of items is (ric(l-fk)A)°‘* ^ is the number of items on the test or, (0.85*(l-0.85)/26)°‘® = 0.070. The corresponding 

standard error of a proportion of persons below a grade of B is (f„(l-f„)/n)°-® where n is the nunber of examinees or, (0.39*(l-0.39)/57)°-® 
= 0.065. The resulting ratio of 0.065/0.070 = 0.923 would be negated, convert®! to the nunber-correct scale, and used as the slope of the 
line through point 3, see Figure 2. Linear interpolation is then used with the smoothed cumulative distribution and the resulting abscissa of 
the point of intersection (b) is 18.047. Biis is the suggested compromise B/C cut-score shown in Figure 2 for grading purposes with this test 
given the CR and E paraneters. llie calculations for the other three cut-points are similar. Note that the slopes are all less than 1 for 
this example. This is the result of somewhat greater reliance on the E standards than on the CR standards in arriving at the compromise 
since there were 57 students and 26 test iteis. llie ratio of standard errors, however, is not just the ratio of nuid)er of persons to nini)er 
of items, but also reflects the expected proportions. 

By using a constant times the ratio of standard errors, we could adjust these cut-scores to yield any desired weighting of E-CR 



O 



Conpronise Grading Model 



standards, perhaps to better reflect the cognitive level of the najority of the test itens. It is interesting to note that setting the 
constant (and hence the slope) to a value near zero results in an equipercentile equating of the snoothed observed score distribution to the 
'noming group' distribution defined by the ideal points. This would seen to be a reasoned nethod of relative grading if the ideal points 
were derived from data over sany sections. 

To Bake the scores aore understandable to students and others, it aay be desirable to follow the popular practice of Resenting the 
results as adjusted raw scores or adjusted percentages that can then be coapared to the stated standards for letter grade decisions. The 
scores, percentages, adjusted scores, and adjusted percentages are shown in Table 1. Letter grades for this exaaple are also shown in Table 1 
where the HR letter grades were calculated using z-scores with cut-scores that reflect the ex^ed percentages of A's, B's, and so on. The 
suggested or coapronise letter grades are in the last coluan labeled NR/CR. Rote how the NR/CR grades aediate the HR and CR grades soaewhat 
differently at each score level. This procedure is not equivalent to a sii?)le 'averaging' of NR and CR grades. 

For the exaaple data, the aean nuaber-correct score is 15.47, the standard deviation is 4.16, and the reliability (KR-21) is 0.66. 
Using a Kolaogorov-Siiimov one saaple test of fit, the aaxima absolute difference is 0.095. The null hypothesis (todel fits the data) is 
accepted at p = 0.985. The beta^)inoaial has successfully fit (consenatively, at o = 0.20) over 951 of real data sets so far investigated 
and has fit 1001 of the author's classrooa data for the past two years. 

The algoritha has been prograaaed in (standard) Pascal for the IBH aicrocoaputer. There are several additional outputs, the prograa 
can be run in batch node or interactively, and there is an accoipnying docuaent. It is available without cost froa the author when the 
request is accoapanied by a foraatted disk and staaped aailer. 

Conclusions and Educational laportance 

Grades are iaportant: they are the coin-of-the-reala in education. Many teachers find the task of evaluation difficult and night 
welcoae a structured aethod for obtaining, at least, suggested letter grades in those situations \Aere adherence to absolute standards would 
result in an unacceptable distribution of letter grades. 

Continuing to adjust proportion-correct standards by ad hoc aethods is neither re^oned nor reliable. Structured grading aethods 
such as this would also pemit teachers of aultiple sections or those within the sane departaent to give soaewhat coiprable grades to their 
students if they used agreed-upon NR standards and individual CR standards that reflect professional judgeaent about differences in the 
difficulties and objectives of their individual tests. The use of a coaputer to assist in grading decisions aeans that practical and useable 
approaches need not be overly siaplistic. This coaproaise ai^t prove aost useful sdien an entirely new test is used or ^rtlen an unfaailiar 
group of students is encountered. When a teacher is obliged to adhere to grading standards as in the exaaple (931 and above for an 'A'), 
giving tests that challenge all students and that reflect higher level cognitive skills becoaes virtually iapossible without soae aeans of 
score adjustaent. 'Eyeballing' a set of scores is si^ly not good grading practice. 



References 



Beak, C. H. (1984). A aethod for reaching a coaproaise between absolute and relative standards in exaainations. Journal of Educational 
Heasureaent . 21, 147-152. 

De Gruijter, D. N. N. (1985). Coaproaise aodels for establishing exaaination standards. Journal of Educational Heasureaent , 22, 263-269. 

Duncan, G. T. (1974). An eapirical Bayes approach to scoring aultiple-choice tests in the aisinforaation aodel. Journal of the Aaerican 
Statistical Association . M, 50-57. 

Fairbank, B. A. (1987). The use of presaoothing and postsaoothing to increase the precision of equipercentile equating. Applied Psychological 
Heasureaent . 11, 245-262. 

Hofstee, N. K. B. (1983). The case for coaproaise in educational selection and grading. In S. B. Anderson, & J. S. Helaick (Eds.), On 
educational testing (pp. 109-127). San Francisco: Jossey-Bass. 




3 

BEST COPY AVAILABLE 

4 



Coipronise Grading Model 



Huynh, H. (1979). Statistical inference for two reliability indices in nastery testing based on the beta-binonial model. Journal of 
Educational Statistics . 4, 231-246. 

Keats, J. A., S Lord, F. M. (1962). A theoretical distribution for mental test scores. Psychometrika . 22, 59-72. 

Lord, F. M., S Hovick, M. R. (1968). Statistical theories of mental test scores . Reading, MA: Addison-Jiesley. 

Terwilliger, F. S. (1989). Classroom standard setting and grading practice. Educational Keasurement; Issues and Practice . 8, 15-19. 
Mainer, H., S Schacht, S. (1978). Gapping. Psvchometrika . 43, 203-212. 



Figure 1 Cumulative distribution of the observed number-correct 
raw scores. 



1.0 - 



84 

0.8 

61 

0.6 

,43 

0.4 

23 

0.2 



0.0 - 




5 10 15 20 25 

16.9 19.5 22.1 24.2 
Observed Number-Correct Raw Scores 







4 



CoHDroaise Grading Model 



Figure 2 Cumulative distribution of the smoothed number-correct 
raw scores. 




Table 1 Exaople Data With NS, CS, and Conpronise Grades 



X 


Freq 


1 


Adj. X 


Adj. 1 


z-score 


NR CR NR/CR 


23 


2 


88.5 


24.81 


95.44 


1.808 


A 


B 


A 


22 


3 


84.6 


24.42 


93.91 


1.568 


A 


C 


A 


21 


2 


80.8 


23.93 


92.05 


1.328 


A 


C 


B 


20 


2 


76.9 


23.31 


89.66 


1.087 


A 


C 


B 


19 


6 


73.1 


22.69 


87.27 


0.847 


B 


D 


B 


18 


2 


69.2 


22.05 


84.80 


0.607 


B 


D 


C 


17 


8 


65.4 


20.97 


80.67 


0.367 


B 


D 


C 


16 


6 


61.5 


19.90 


76.53 


0.126 


C 


F 


C 


15 


3 


57.7 


18.87 


72.57 


-0.114 


C 


F 


D 


14 


5 


53.8' 


17.87 


68.72 


-0.354 


D 


F 


D 


13 


5 


50.0 


16.86 


64.84 


-0.594 


D 


F 


F 


12 


2 


46.2 


15.56 


59.85 


-0.835 


F 


F 


F 


11 


2 


42.3 


14.26 


54.86 


-1.075 


F 


F 


F 


10 


5 


38.5 


12.97 


49.87 


-1.315 


F 


F 


F 


9 


1 


34.6 


11.67 


44.89 


-1.555 


F 


F 


F 


8 


2 


30.8 


10.37 


39.90 


-1.796 


F 


F 


F 


5 


1 


19.2 


06.48 


24.94 


-2.516 


F 


F 


F 



Note. X is the raw nuBber-correct score; NK/CR is 
the conpronise grade. 



ERiC 



6 



AERA April 8-12, 1996 




U.S. DEPARTMENT OF EDUCATION 
Office of Educational Research and Improvement (OERf) 
Educational Resources Information Center (ERIC) 

REPRODUCTION RELEASE 

(Specific Document) 




I. DOCUMENT IDENTIFICATION: 





naoDeu cums^on 


^ A.JOHW'OiS oru 


Corporate Source; 

aW( U&R-S’f'TV 


Publication Date; 

fvpfllL 17 ^ a 



II. REPRODUCTION RELEASE: 



In order to disseminate as widely as possible timely and significant materials of interest to the educational community, documents 
announced in the monthly abstract journal of the ERIC system. Resources in Education (RIE). are usually made available to users 
in microfiche, reproduced paper copy, and electronic/optical media, and sold through the ERIC Document Reproduction Service 
(EDRS) or other ERIC vendors. Credit is given to the source of each document, and. if reproduction release is granted, one of 
the following notices is affixed to the document. 

If permission is granted to reproduce the identified document, please CHECK ONE of the following options and sign the release 
below. 



2 



Sample sticker to be affixed to document Sample sticker to be affixed to document 



Check here 

Permitting 

microfiche 

(4“x 6“ film). 

paper copy. 

electronic, 

and optical media 

reproduction 



"PERMISSION TO REPRODUCE THIS 




"PERMISSION TO REPRODUCE THIS 


MATERIAL HAS BEEN GRANTED BY 




MATERIAL IN OTHER THAN PAPER 






COPY HAS BEEN GRANTED BY 








TO THE EDUCATIONAL RESOURCES 






INFORMATION CENTER (ERIC)." 




TO THE EDUCATIONAL RESOURCES 






INFORMATION CENTER (ERIC)." 


Level 1 


Level 2 



or here 

Permitting 
reproduction 
in other than 
paper copy. 



Sign Here, Please 

Documents will be processed as indicated provided reproduction quality permits. If permission to reproduce is granted, but 
neither box is checked, documents will be processed at Level 1. 




"1 hereby grant to the Educational Resources Information Center (ERIC) nonexclusive permission to reproduce this document as 
indicated above. Reproduction from the ERIC microfiche or efectronic/optical media by persons other than ERIC employees and its 
system contractors requires^eYnission from the copyright holder. Exception is made for non-profit reproduction by libraries and other 
service agencies to saji^f^nfc^ation needs of educators in response to discrete inquiries." 


Si^n^re: 1 ^ 

— 


Position^ 


c j 6 waj <^drc 


Organization: 

cou_(^Q><p- c2V)\Ajr 


Address: O0\ C 

uro 1 uTRScT V 

0H,d^70l 


Telephone Number: . 


Date: 

AP/2 IL ^1 , 



CUA 




THE CATHOLIC UNIVERSITY OF AMERICA 

Department of Education, O’ Boyle Hall 
Washington, DC 20064 
202 319-5120 

February 27, 1996 
Dear AERA Presenter, 

Congratulations on being a presenter at AERA’. The ERIC Clearinghouse on Assessment and 
Evaluation invites you to contribute to the ERIC database by providing us with a written copy of 
your presentation. 

Abstracts of papers accepted by ERIC appear in Resources in Education (RIE) and are announced 
to over 5,000 organizations. The inclusion of your work makes it readily available to other 
researchers, provides a permanent archive, and enhances the quality of RIE. Abstracts of your 
contribution will be accessible through the printed and electronic versions of RIE. The paper will 
be available through the microfiche collections that are housed at libraries around the world and 
through the ERIC Document Reproduction Service. 

We are gathering all the papers from the AERA Conference. We will route your paper to the 
appropriate clearinghouse. You will be notified if your paper meets ERIC's criteria for inclusion 
in RIE: contribution to education, timeliness, relevance, methodology, effectiveness of 
presentation, and reproduction quality. 

Please sign the Reproduction Release Form on the back of this letter and include it with two copies 
of your paper. The Release Form gives ERIC permission to make and distribute copies of your 
paper. It does not preclude you from publishing your work. You can drop off the copies of your 
paper and Reproduction Release Form at the EMC booth (23) or mail to our attention at the 
address below. Please feel free to copy the form for future or additional submissions. 

Mail to: AERA 1996/ERIC Acquisitions 

The Catholic University of America 
O'Boyle Hall, Room 210 
Washington, DC 20064 

This year ERIC/AE is making a Searchable Conference Program available on the AERA web 
page (http://tikkun.ed.asu.edu/aera/). Check it out! 

Sincerely, 

(_JLiawrence M. Rudner, Ph.D. 

Director, ERIC/AE 




’If you are an AERA chair or discussant, please save this form for future use. 





EFIICI Clearinghouse on Assessment and Evaluation 



